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ABSTRACT 

Diverse life forms are driven by the evolution of gene 
regulatory programs including changes in regulator 
proteins and c/s-regulatory elements. Alterations of 
c/'s-regulatory elements are likely to dominate the 
evolution of the gene regulatory networks, as they 
are subjected to smaller selective constraints 
compared with proteins and hence may evolve 
quickly to adapt the environment. Prior studies on 
c/s-regulatory element evolution focus primarily on 
sequence substitutions of known transcription 
factor-binding motifs. However, evolutionary models 
for the dynamics of motif occurrence are relatively 
rare, and comprehensive characterization of the evo- 
lution of all possible motif sequences has not been 
pursued. In the present study, we propose an algo- 
rithm to estimate the strength of purifying selection of 
a motif sequence based on an evolutionary model 
capturing the birth and death of motif occurrences 
on promoters. We term this measure as the 'evolu- 
tionary retention coefficient', as it is related yet 
distinct from the canonical definition of selection co- 
efficient in population genetics. Using this algorithm, 
we estimate and report the evolutionary retention co- 
efficients of all possible 10-nucleotide sequences 
from the aligned promoter sequences of 27748. 
orthologous gene families in 34 mammalian species. 
Intriguingly, the evolutionary retention coefficients 
of motifs are intimately associated with their func- 
tional relevance. Top-ranking motifs (sorted by evolu- 
tionary retention coefficients) are significantly 
enriched with transcription factor-binding sequences 
according to the curated knowledge from the 
TRANSFAC database and the ChlP-seq data 
generated from the ENCODE Consortium. Moreover, 
genes harbouring high-scoring motifs on their 



promoters retain significantly coherent expression 
profiles, and those genes are over-represented in 
the functional classes involved in gene regulation. 
The validation results reveal the dependencies 
between natural selection and functions of c/s-regu- 
latory elements and shed light on the evolution of 
gene regulatory networks. 

INTRODUCTION 

Diverse life forms are largely driven by conservation and 
variations of the gene regulatory circuits. Recent progress 
in high-throughput technologies such as next-generation 
sequencing platforms and DNA microarrays enables biolo- 
gists to map the regulatory networks and investigate their 
evolution across multiple species. For instance, studies in 
evolutionary developmental biology (EvoDevo) compared 
the gene regulatory networks for animal development and 
discovered conserved cores responsible for body plan for- 
mation and variable modules modifying species-specific 
phenotypes such as the shapes of limbs or wings [e.g., (1,2)]. 

One remarkable feature from the gene regulatory 
networks of multiple species is the conservation of their 
constituent proteins (3). Most proteins possess multiple 
functions (pleiotropic), hence are subjected to tight selective 
constraints. Alterations on protein sequences (e.g., changes 
on the DNA-binding domain of a transcription factor) may 
affect many partners (e.g., changes on the bindings of all 
targets of a transcription factor), thus are likely to be dele- 
terious. In contrast, alterations on «'y-regulatory elements 
have local effects and thus enable the systems to evolve in 
an incremental fashion. Consequently, evolution of non 
protein-coding regions in general and cz's-regulatory 
elements in particular plays a critical role in the evolution 
of the gene regulatory systems. 

Early studies of cw-regulatory element evolution focus 
on identification of conserved transcription factor-binding 
motifs (4) and detection of conserved regions on gene pro- 
moters (5). Sequence conservation alone, however, does 
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not suffice to account for the evolution of gene regulatory 
systems. Comparison of known cw-regulatory elements on 
closely related species indicates high rates of turnover and 
divergence (6-10). These changes may yield gains and 
losses of ra-regulatory elements (19), modify the regula- 
tory programs (1,2) or are accompanied by compensatory 
mutations to maintain stable regulatory programs (11,12). 

Like protein-coding regions, evolution of cw-regulatory 
elements is driven by a variety of mechanisms including 
sequence substitutions (18), gene duplications (13), 
tandem repeat insertions and deletions (14). Cw-regula- 
tory elements are added or deleted on the promoters/en- 
hancers according to these mechanisms. Ideally, a 
complete model for the evolution of c/.v-regulatory 
elements should be based on the models of all individual 
mechanisms for molecular evolution. In practice, mechan- 
isms other than sequence substitutions are hard to formal- 
ize. Consequently, the majority of quantitative models for 
cM-regulatory element evolution are derived from 
sequence substitution processes. Several studies use simu- 
lations to examine the effects of sequence mutations on the 
rates for cw-regulatory element evolution [e.g., (15,16)]. 
Others start with sequence substitution models in popula- 
tion genetics and attempt to identify the cz'.v-regulatory 
elements under selection [e.g., (17-19)]. Despite the 
fruitful outcomes generated from these studies, they 
suffer from two major limitations. First, they focus pri- 
marily on the deviation of observed sequences from a 
known regulatory element (e.g., a transcription factor- 
binding motif) rather than the changes of regulatory 
element occurrence on promoters. Alterations on motif 
counts can be more critical for gene regulation than 
specific sequence variations, as the former modulate the 
number of transcription factors bound on promoters. 
Second, all the current studies only examine a collection 
of known transcription factor-binding motifs. Complete 
characterization of the evolution of all possible motif se- 
quences of a fixed length is lacking. This characterization, 
however, is critical for discovering new regulatory 
elements and comprehending their evolution on genomes. 

Previously, we proposed an evolutionary model and an 
algorithm to quantify the strength of natural selection of a 
motif sequence (20). The evolution of motif occurrence 
was formulated as a birth-death process, whereas the 
rates of motif additions and deletions were derived from 
substitutions of their constituent sequences. The evolu- 
tionary retention coefficient of a motif was defined as a 
penalty to slow down motif death, and the evolutionary 
retention coefficient value maximizing the log likelihood 
of the data was estimated. In the present study, we extend 
this model and evaluate the evolutionary retention coeffi- 
cients of all the 4 10 = 1 048 576 10-nucleotide sequences on 
the promoters of 27 748 orthologous gene families from 34 
mammalian species. Intriguingly, evolutionary retention 
coefficients of the 10-mer sequences are significantly 
associated with the tendency of transcription factor- 
binding events and expression coherence of the genes 
harbouring the motifs. By examining the annotations of 
the top-ranking motifs, we find many of them match the 
GC-rich binding sequences of the transcription factors. 
Furthermore, genes harbouring the top-ranking motifs 



are highly enriched with the processes of transcriptional 
regulation. The results provide a comprehensive picture of 
the evolution of cw-regulatory elements. 

MATERIALS AND METHODS 

Data sources 

Aligned 5kb upstream sequences of 27 748 orthologous 
gene families from 34 mammalian species were extracted 
from the UCSC Genome Browser (21). Supplementary 
Table SI and Figure SI report the names and the phylo- 
genetic tree of the selected species. 

To validate the functional relevance of the high-scoring 
motifs, we downloaded external datasets from the follow- 
ing sources: the consensus motifs of transcription factor- 
binding sequences from the TRANSFAC database (22), 
407 ChlP-seq data files from the ENCODE database (23), 
DNA microarray data of human and mouse tissue gene 
expressions (24) and RNA-seq data of human tissue gene 
expressions (25), the annotations and member genes of 
3201 Gene Ontology (GO) categories (27) and pathway 
information from three databases (28-30). 

Quantifying the strength of natural selection 
of motif sequences 

We define a motif as a collection of sequences with the 
same length. Over time motifs are created, annihilated or 
maintained in a specified region (e.g., a gene promoter) by 
sequence substitutions of the constituting nucleotides. 
Motifs undergoing purifying selection would possess 
slower rates of annihilation than those without selective 
constraints. Accordingly, we quantify the strength of 
natural selection of a motif by comparing the empirical 
distribution of its occurrences over multiple species with 
the one generated by a neutral evolutionary model. The 
evolutionary model of motif occurrences and the algo- 
rithm of evaluating the evolutionary retention coefficients 
of motifs are described below. 

A Poisson process model of sequence substitution 

We adopt the simplest model of sequence substitution 
assuming all nucleotides at all positions and across all 
lineages transition with an equal rate (31). In an infinitesi- 
mal time interval dt, the nucleotide sequence of a position 
transitions to another base with probability Xdt. n s (t) 
denotes the cumulative number of sequence changes at 
time t. The transitions within the time interval [t, t + dt] 
is as follows 

P(n s (t+dt) = (N+l)\n s (t) = N) = Xdt. 
P(n,(t+dt) = N]n s (t) = TV) = 1 - Xdt. 

and n s (t) has a Poisson distribution 
(Xt) N 

P(n s (t) = N\n s (0) = 0) = ^e- Xt . (2) 

The maximum likelihood estimate of X is simply the total 
number of sequence changes divided by the total length of 
the time interval considered. In this work we estimated X 
from the aligned 5kb promoter sequences of the 27748 
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gene families over the 34 mammalian species. For each 
position of the aligned promoters in each gene family, 
we observed the sequences in the terminal nodes (the 34 
extant species) of the species tree and inferred the se- 
quences in the internal nodes (ancestral species) by a 
dynamic programming algorithm (32). We then counted 
the total number of sequence changes along all branches 
of the species tree for all positions and all gene families 
and the total lengths of the time intervals, and calculated X 
accordingly. From the empirical data, X = 0.8371. 

A birth-death model for the evolution of motif occurrences 

A motif M c B lm {B = {A,G,T}) is denned as a collection 
of nucleotide sequences of fixed length /„,. We first consider 
the sequence evolution of /„, consecutive positions. There 
are 4 Zm possible sequences that can occur in this / m -mer 
window, and each sequence s e B 1 '" can be labelled as 
either a member of the motif (s e M) or not (s<£A4). 
These sequences comprise an undirected graph G = (V, 
E), where a node v e V denotes an /„,-mer sequence and 
an edge e = (vi, v 2 ) denotes a sequence pair vi and v 2 dif- 
ferent at one position. The evolution of /,,,-mer sequences 
can be viewed as a Markov random walk on G. In an in- 
finitesimal time interval, a sequence can only transition to a 
neighboring node in G and the rate of transition is XI,,,. 

A motif M constitutes a subset of nodes in G (black 
nodes in the left diagram of Figure 1), while the remaining 
nodes are non-motif sequences (white nodes in the left 
diagram of Figure 1). We are interested in the transition 
rate from non-motif sequences to motif sequences and vice 
versa. With a simplifying approximation, we characterize 
these transitions with two numbers: r$\ as the fraction of 
all non-motif — > motif transitions among all transitions 
from non-motifs, and r 10 as the fraction of all motif — > 
non-motif transitions among all transitions from motifs. 
P A , P c , Pg an d Pt denote the background frequencies of 
the four nucleotides obtained from all promoters of the 34 
species. r 0] and r l0 are calculated by the following 
formulas: 



'01 



rw = 



{(, | ,. i )€& | €X} 



(3) 



Where §(•) is an indicator function and ro(vi, v 2 ) is the 
nucleotide background probability of v 2 at the position 
where V\ and v 2 differ. For instance, <n(AGGC, 
AGTC) = Pt. co(vi, v 2 y& rescale the weights of transitions 
according to the frequencies of the destination sequences. 
For instance, transitions to GC-rich sequences are more 
likely to occur on mammalian promoters, as they are 
over-represented in the CpG islands (33). 

n{t) denotes the number of motif occurrence at time /. In 
an / m -mer window, «(/) e {0,1} as the sequence is either a 
motif or not. The transitions of n(t) hence conform with 
the following equations 



P(n(t+l) = l\n(t) = 0) = Xl„,r Qi dt. 
P(n(t+dt) = 0\n(t) — 1) = Xl m r w dt. 



(4) 



We now extend the analysis to the entire promoter of 
length 4 ^> l m . Suppose motif occurrence at time t is 
n(t) = n and the n occurring motifs are not overlapped. 
Each motif instantiation can be annihilated with a rate 
Xl m r\Q. Hence the 'death rate' on the entire promoter is 
the rate on an / m -mer window multiplied by n: 



P(n(t+dt) 



1 \n(t) = n) = Xl„,r\ondt. 



(5) 



There are 4 — 4?« positions unoccupied by motif se- 
quences, and the maximum number of (possibly 
overlapped) /,,,-mer windows is l s — l m n — l m +l. Each of 
these /m-mer windows can generate a new motif. Hence 
the 'birth rate' on the entire promoter is approximately 
the rate on an /,,,-mer window multiplied by 
4 - l m n - /„,+!: 



P(n(t+dt) = n+l\n(t) = n) = XI,„r oi (l s - l„,n - l m +\)dt. 



(6) 



Equations (6) and (5) specify a birth-death process (34) of 
motif occurrence on a promoter of length 4- The distribu- 
tion P„(t) = P(n(t) — n) of motif occurrences over time 
can be expressed as a system of differential-difference 
equations: 



= Mn/MO - x(0)p 0 (t). 

= X(n - l)P n _ x {t)+n{n+V)P n+l (t) 
X(n) = Xl,„r 0i (l s - l m n - l,„+l). 
ix{ri) — Xl m r w n. 



(X(n)+n(n))P n (t). 



(7) 



The system is illustrated by the right diagram of Figure 1 . 

The aforementioned model assumes that sequences 
randomly drift and henceforth no selective pressure is 
exerted on the evolution of motif occurrence. In 
contrast, purifying selection should penalize decrements 
of motif occurrence. Therefore, the evolutionary model 
of motif occurrence under purifying selection largely re- 
sembles the model for neutral evolution (Equation 7) 
except for a modification of the motif death rate: 



H\ri) = 



H,{n) 



(8) 



The motif death rate //(«) under selection slows down the 
neutral motif death rate /x(n) by a factor s > 1 . We term s 
as the evolutionary retention coefficient of a motif. 
Notably, this definition is related yet distinct from the 
canonical definition of selection coefficient in population 
genetics (35). In population genetics, the selection coeffi- 
cient is the decline of the relative fitness of a selectively 
disadvantageous genotype compared with that of a select- 
ively favoured genotype. In a sufficiently large population, 
the selectively advantageous genotype will appear with a 
higher frequency than that of a genotype without selec- 
tion. In this regard, both the canonical selection coefficient 
and evolutionary retention coefficient aim for capturing 
the strength of purifying selection from the observed geno- 
types. However, despite the common goals shared by the 
two measures, the evolutionary retention coefficient is 
distinct from the canonical selection coefficient in two 
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Figure 1. Left: A sequence space of fixed length as a graph. A node denotes a sequence, and an edge denotes two sequences differing at one position. 
Black nodes are members of a motif and white nodes are non-motifs. Dotted edges denote transitions between motifs and non-motifs. Solid edges 
denote transitions within motifs and non-motifs. Right: The state transition diagram of a birth-death model. State n denotes the count of motif 
occurrence on a promoter. l(n) and m(n) denote the birth and death rates emanating from state n. 



aspects. First, the evolutionary retention coefficient 
bypasses the abstract notion of relative fitness and 
directly tackles the consequences of purifying selection — 
elevation of motif occurrence frequencies. Second, the ca- 
nonical selection coefficient examines the allele frequencies 
of a single site, whereas the evolutionary retention coeffi- 
cient is inferred from the frequencies of motif occurrence 
in a consecutive region of the genome. We will further 
clarify the relation between these two scores in simulation 
studies. 

Estimating the evolutionary retention coefficients 
from empirical data 

We estimated the evolutionary retention coefficient of a 
/,,,-mer motif from the aligned 5 kb promoter sequences of 
34 mammalian species with the following procedures. 
First, we divided a promoter into multiple segments of 
fixed length /, — 30/. Segments with >10% gaps in any 
species were discarded. This partition reduces the 
number of valid terms in Equation (7), hence greatly 
simplifies subsequent estimation. 

Second, we treated humans as the reference species and 
assumed that alterations of motif counts from the reference 
to another species followed the birth-death process. / 
denotes the distance between humans and another species 
x in the phylogenetic tree, no and m the motif counts in the 
segments of humans and species x, respectively. For each 
combination of t (or species x), «o and m, we then counted 
f(t,no,n\), the total number of segments with «o and n\ motif 
instances in the counterparts of humans and species x. 

Third, the log likelihood of motif occurrences can be 
expressed as 

C = E E E/ (? '"°' Jl ) 1 °g /> ("(0 = "i w°) = M «) +C - w 

where C is a constant and P(n(i) = «i|«(0) = n 0 ) denotes 
the conditional probability derived from the birth-death 
model under selection by 6. applying the death rate of 
Equation (8) in Equation (7). Given the relatively short 
length of segments, we only considered motif occurrences 
up to 3 and restricted the terms in Equation (7) to n < 3 
accordingly. For each fixed value of evolutionary reten- 
tion coefficient s, we solved P(n(t) = «i|«(0) = «o) 
numerically by the finite difference method for ordinary 



differential equations. To estimate s that maximizes the 
log likelihood in Equation (9), we used a binary search 
to find the optimal „v over the interval [0,20]. 

Comparison of selection coefficients and evolutionary 
retention coefficients in simulated data 

To elucidate the relation between selection coefficients and 
evolutionary retention coefficients, we simulated haploid 
sequence evolution with varying selection coefficients and 
compared the evolutionary retention coefficients estimated 
from the observed data with the given selection coefficients. 
Given a promoter sequence of fixed length (30 nucleotides) 
and a motif (10 nucleotides), we define the relative fitness of 
the promoter as f = 1 — (2 — min(k,2))er, where k is the 
number of motif occurrence on the promoter and a the 
selection coefficient. The sequence containing > 2 motif in- 
stances possesses the highest fitness, whereas the sequences 
containing 1 and 0 motif instance possess intermediate and 
low fitness, respectively. 

We simulated promoter evolution according to both 
sequence substitutions of individual positions and purify- 
ing selection dictated by motif occurrence. One promoter 
sequence was randomly generated in the first generation. 
In each of the following generations, each sequence 
produced 10 progenies with a Poisson mutation rate of 
0.02 per position. Among the progenies from the same 
cohort, 100 of them were selected with probabilities pro- 
portional to their relative fitness, and the remaining se- 
quences were eliminated. This process of sequence 
substitutions and selection lasted for 100 generations. 
For each selection coefficient, we generated randomly 20 
motif and initial promoter sequences and simulated their 
evolution separately. Finally, we repeated simulations for 
the following selection coefficient values: 0.01, 0.05, 0.1, 
0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.49. 

In each simulation, the coalescent tree of the 100 
observed sequences was recorded. We chose the 
sequence closest to other observed sequences as the refer- 
ence and evaluated their phylogenetic distances according 
to the structures and branch lengths of the coalescent tree. 
The evolutionary retention coefficient of each designated 
motif can be consequently inferred from the simulated 
sequences. 
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An exhaustive evaluation of evolutionary retention 
coefficients of all 10-mers on mammalian promoters 

Using the aforementioned algorithm, we estimated the 
evolutionary retention coefficients of all 4 10 = 1048576 
10-mer sequences on the aligned 5kb promoter sequences 
of 34 mammalian species. To accelerate computations, we 
ran the estimation procedures on two PC cluster systems 
simultaneously. Eight jobs were assigned in parallel to the 
HP DL360 G7 servers containing dual Intel Xeon E5520 
CPUs with 2.27 GHz and 24 GB main memory, and 10 
jobs were assigned in parallel to the HP BL460C servers 
containing Intel Xeon CPUs with 3.16 GHz and 16 GB 
main memory. The total running time was 6048 hours. 
Notably, although the theoretical framework we present 
can handle more general motifs (i.e., a collection of nu- 
cleotide sequences), in this work we only investigate the 
single sequence motifs (i.e., occurrences of a particular 
10-mer sequence), as their evolutionary retention coeffi- 
cients can be exhaustively calculated with limited 
computing resources. 

Functional validation of motif sequences under 
selective pressure 

We incurred the following four tests to validate the 
functional relevance of motif sequences under selection. 

Enrichment of TRANSFAC motifs 

First, we demonstrated that evolutionary retention coeffi- 
cients were correlated with enrichment of transcription 
factor-binding motifs extracted from TRANSFAC (22). 
A non-parametric statistical test was used to evaluate 
the enrichment of transcription factor-binding motifs in 
high-scoring sequences. In brief, all the 104 857 610-mer 
sequences were sorted by their evolutionary retention co- 
efficients in a descending order. 168 397 of these sequences 
matched completely or partially with transcription factor- 
binding motifs in TRANSFAC. We defined i*\(x) over the 
normalized rank x = r -^r e [0,1] resembling the cumula- 
tive distribution function (CDF) of TRANSFAC motifs 
over the sorted 10-mer sequences. 

# (TRANSFAC motifs in sequences 1 [x ■ 4 10 ]) 
Fl ^ ~~ # TRANSFAC motifs in sequences 

(10) 

F[(x) should have a high area under the curve if 
TRANSFAC motifs are enriched in the top-ranking se- 
quences. In contrast, the null hypothesis stipulates that 
TRANSFAC motifs are evenly distributed along the 
sorted sequences and the corresponding CDF is 
Fq(x) = x. The maximum deviation between F\{x) and 
Fq(x) gives rise to a statistic of the Kolmogorov- 
Smirnov test. This method is similar to the Gene Set 
Enrichment Analysis (GSEA) (36). 

Enrichment of protein-binding sites from the 
ENCODE data 

Second, we demonstrated that the top-ranking motifs were 
enriched in the protein-binding sites of the human genome 
reported from the ChlP-seq data generated by the 



ENCODE consortium (23). 407 ChlP-seq data files were 
extracted from the ENCODE website (http://hgdownload. 
cse.ucsc.edu/goldenPath/hgl9/encodeDCC/wgEncodeHai 
bTfbs/). Each file reports the sequences of fragments con- 
taining the binding sites of one protein in one cell type. 
For each motif, we constructed a simple null model 
assuming its occurrence on the entire human genome 
followed a Poisson process. The rate of motif occurrence 
per position is j-, where N denotes the number of motif 
occurrence in the entire genome and L denotes the genome 
length. Suppose in a ChlP-seq file, the total length of frag- 
ments is / and the number of motif occurrence is n. Then 
the Poisson rate of motif occurrence in the designated 
fragments is ?? = ^ and the P- value for motif enrichment is 



*— ' ml 

m=n 

Coherence of expression profiles in human 
and mouse genes 

Third, we showed that human and mouse genes 
harbouring high-scoring motif sequences tended to have 
coherent expression profiles compared with genes 
harbouring low-scoring motif sequences. We used both 
oligonucleotide microarray data (24) and RNA-seq data 
(25) in defining expression levels and co-expression of 
human and mouse genes. For the microarray data, we 
obtained the expression information of human genes and 
mouse genes from the Gene Atlas V2 dataset (http:// 
symatlas.gnf.org/SymAtlas/). This dataset comprises 
oligonucleotide microarray data in 63 human and 58 
mouse normal tissues sampled from animal bodies. We 
assigned the expression data from probe sets to corres- 
ponding Ensembl genes following (37,38). The expression 
levels of a gene in a specific tissue were averaged among 
replicates. For the RNA-seq data, we obtained that of 1 1 
human tissues from GEO Series GSE13652 from the 
University of Toronto (25) (brain/liver/muscle/cerebral 
cortex) and GSE 12946 from MIT (26) (adipose/breast/ 
colon/heart/lung/lymph node/testes). The raw 32-mer 
RNA-seq sequence reads were mapped to human 
genome (Ensembl version v56), and RNA-seq-based 
gene expression levels were calculated according to 
(39,40). 

The expression profile divergence between two genes in 
the human or mouse genome was defined by 1 — r, where r 
is the Pearson's correlation coefficient of expression levels 
across the tissues. In the present study, we specifically 
examined co-expression of genes that are not paralogs or 
genes located on different chromosomes. The chromo- 
somal coordinates and annotations of paralogous rela- 
tionships of human and mouse genes based on Ensembl 
v62 were obtained through BioMart (http://www.biomart 
•org/). 

Functional enrichment of genes harbouring 
high-scoring motifs 

Fourth, we showed that human genes harbouring 
high-scoring motif sequences were enriched with certain 
functional classes. Four sources pertaining to functional 
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information of human genes were extracted: the GO 
categories (27), the curated pathway databases of 
Reactome (28), Biocarta (29) and the NCI-Nature cur- 
ations (30). For a given motif, we extracted the genes 
harbouring the motif on their promoters and calculated 
hyper-geometric P-values of enrichment for each GO 
category and pathway. The enriched functional classes 
for both top-ranking motifs (evolutionary retention coef- 
ficient > 3.0, 231 motifs) and control motifs (231 motifs 
surrounding the median of the sorted list) were reported. 

RESULTS 

Evolutionary retention coefficients are correlated with 
selection coefficients in simulated data 

We first demonstrate the resemblance between canonical 
selection coefficients and motif evolutionary retention co- 
efficients with simulated data. For each of 11 
pre-determined selection coefficient values, we simulated 
the evolution of 20 phylogenies, where each phylogeny 
constituted 100 generations and 100 observed promoter 
sequences in their leaves. We then inferred the motif evo- 
lutionary retention coefficient from the observed promoter 
sequences of each simulated phylogeny. The left part of 
Figure 2 shows the (pre-determined) canonical selection 
coefficients and the (inferred) motif evolutionary retention 
coefficients of the 220 phylogeny instances. Overall, the 
two scores exhibit a positive correlation coefficient 
(r = 0.678). Because direct evaluation of relative fitness 
in a population is often challenging, the high correlation 
between the two quantities indicates that the motif reten- 
tion coefficient is a reasonable measure for the selective 
strength of motifs. 

Summary of evolutionary retention coefficients of 
10-mer motif sequences 

We evaluated the evolutionary retention coefficients of all 
4 10 = 1048576 10-mer sequences among the 5kb pro- 
moters of 27 748 gene families in 34 mammalian species. 
The right part of Figure 2 shows the empirical distribution 
of evolutionary retention coefficients, and Supplementary 
Table S2 reports the sorted evolutionary retention coeffi- 
cients of these sequences. As expected, the majority of 
sequences possess low evolutionary retention coefficients: 
the median value is 0.508 and the scores of about 80% of 
the sequences (834 552 of 1 048 576) are below 1.0. We 
considered the first 231 (0.022%) sequences with evolu- 
tionary retention coefficients > 3.0 as the top-ranking 
motif sequences and employed further analysis to these 
sequences. 

Motifs with high evolutionary retention coefficients 
have slower death rates, thus should exhibit high level 
conservation on promoters. For each motif, we define a 
conservation measure as the probability of its presence on 
the promoter of a mammalian species, conditioned on its 
presence on the orthologous promoter of humans. 
Figure 3 displays the conditional probabilities of motif 
presence of the 231 top-ranking motifs (top panel) and 
231 control motifs (bottom panel) with evolutionary re- 
tention coefficients near the global median and with > 50 



instances on human promoters. The species (indices in the 
horizontal axis) are sorted by their phylogenetic distances 
to humans. All (both high-scoring and control) motifs 
have high level conservation between humans and anthro- 
poid primates (chimpanzees, gorillas, orangutans, 
macaques, indices 2-5) and marmosets (Callithrix 
jacchus, index 6), and the conditional probabilities drop 
abruptly beyond marmosets. For instance, the median 
conditional probabilities between humans and chimpan- 
zees (index 2), orangutans (index 4) and marmosets (index 
6) are 0.861, 0.697 and 0.320 respectively, whereas the 
median conditional probability between humans and the 
Philippine tarsiers (Tarsius syrichta, index 7) drops below 
0.074. However, the top-ranking motifs retain consider- 
ably higher level conservation than control motifs in all 
selected mammals. For instance, the median conditional 
probabilities of the top-ranking motifs between humans 
and guinea pigs (Cavia porcellus, index 14), horses 
(Equus caballus, index 21) and African elephants 
(Loxodonta africana, index 28) are 0.074, 0.138 and 
0.070 respectively, whereas those of the control motifs 
are 0.030, 0.072 and 0.031 respectively. 

Notably, in Figure 3 all but one of the top 23 motifs 
have relatively low level conservation compared with the 
remaining top-ranking motifs. By examining those se- 
quences (Table 1), we found they all belonged to the 
Alu-J repeat elements (41). They exhibit background 
level conservation between humans and most other 
species but undergo a massive number of insertions on 
the promoters of gray mouse lemurs {Microcebus 
murinus, index 8) and small-eared galagos (Otolemur 
garnettii, index 9) (Supplementary Table S3). These inser- 
tions violate the Poisson process model of sequence sub- 
stitutions, increase the counts of f{t,n 0 ,ni) where n\ > 0, 
and therefore elevate the evolutionary retention 
coefficients. 

High-scoring motifs are enriched with transcription 
factor-binding sites 

Transcription factor-binding sites likely accommodate 
some motif sequences under selective pressure. We con- 
firmed the dependency of transcription factor-binding 
sites and evolutionary retention coefficients with two 
external datasets. First, we verified that sequences with 
higher evolutionary retention coefficients tended to 
match the transcription factor-binding motifs reported in 
the TRANSFAC database (22). A simple check on se- 
quences sorted by evolutionary retention coefficients 
provides obvious evidence: 77 of the top 231 10-mer se- 
quences match TRANSFAC motifs, whereas only 35 of 
the 231 sequences in the middle and 22 of the 231 se- 
quences in the bottom of the sorted list match 
TRANSFAC motifs. Supplementary Table S4 reports 
the TRANSFAC match on top-ranking, middle and 
bottom control motifs. 

In addition to observations on small subsets of se- 
quences, we also quantified this dependency on the 
entire sorted list. Denote X a random variable indicating 
the match of sorted sequences with TRANSFAC motifs. 
P(X = x) indicates the probability that a sequence with 
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selection coefficient evolutionary retention coefficient 



Figure 2. Left: The scatter plot of canonical selection coefficients and evolutionary retention coefficients on simulated data. Each point denotes the 
scores obtained from 100 simulated sequences derived from one common ancestor over 100 generations. Right: Empirical distribution of selection 
coefficients among the 4 10 = 1048576 10-mer sequences. The probabilities are displayed in a log scale. 
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species index 



Homo sapiens 



Callithrix jacchus 



Mus musculus 



Oryctolagus cuniculus Equus caballus 



Erinaceus europaeus Dasypus novemcinctus 



Figure 3. Conservation of motif occurrence between humans and another species [.P(motif occurs in a species | motif occurs in humans)] for the top 
231 motifs and 231 control motifs from the middle of the ranked list. The horizontal axis denotes the species index with an increasing distance from 
humans (same as the species order in Supplementary Table SI). The vertical axis denotes the motif index from high selection coefficients (top) to low 
selection coefficients (bottom). The top-ranking and control motifs are separated by a white line. Colours in the heat map denote the levels of 
conditional probabilities between 0 (black) and 1 (bright red). 
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normalized rank x(0 < x < 1) matches TRANSFAC 
motifs. If evolutionary retention coefficients are 
uncorrected with the presence of TRANSFAC motifs, 
then all TRANSFAC motifs should be evenly distributed 
along the normalized ranks, and X follows a uniform dis- 
tribution. Therefore, enrichment of TRANSFAC motifs 
on high-scoring sequences is quantified by the deviation of 
the empirical distribution of X from a uniform 
distribution. 

Figure 4 plots the empirical CDF of X (F x (x) in 
Equation 10) and the CDF of a uniform distribution 
(F 0 (x)). Fi(x) lies above F 0 (x) for all xe[0,l], indicating 
that sequences with high evolutionary retention coeffi- 
cients are more likely to match TRANSFAC motifs 
than those with low evolutionary retention coefficients. 
The P- value of the Kolmogorov-Smirnov test <10~ 325 . 

Second, we showed that the top-ranking motifs were 
enriched in the protein-binding DNA fragments reported 
from the ENCODE data (23). We downloaded 407 files 
from the ENCODE website. Each file reports the 
protein-binding DNA fragments generated by one 
ChlP-seq experiment with a specified antibody and cell 
type. The 407 files cover 59 proteins (transcription 
factors, RNA polymerase II, nucleosome-binding 
proteins, etc.), where multiple ChlP-seq experiments 
with distinct cell types and replicates were undertaken 
for each protein. For each motif, we quantified the signifi- 
cance of its enrichment in an ENCODE file with a null 
model assuming that its occurrence followed a Poisson 
process with a rate r] — ^-, where N denoted the number 
of motif occurrence in the entire genome, L the genome 
length and / the total fragment length in the file. 

We evaluated the enrichment P-values for the 
top-ranking motifs in each ENCODE file. About 12% 
of the motif-file combinations (11 063 of 94017) exhibit 
significant enrichment (.P<10 _20 ). To reduce errors 
generated by individual ChlP-seq experiments, we 
grouped the results of the same proteins together and 
counted the fractions of files in each group with significant 
enrichment. There are 182 motif-protein combinations 
with at least 5 ENCODE files and significant enrichment 
P-values (<10 -20 ) in at least 80% of the constituent 
ENCODE files. The number of enriched motif-protein 
combinations drops considerably in control motifs. 
Among the 231 control motifs in the middle of the 
sorted list, 80 motif-protein combinations retain the 
same level of enrichment. Furthermore, among the 231 
control motifs in the bottom of the sorted list, only 38 
motif-protein combinations retain the same level of en- 
richment. Intriguingly, both TRANSFAC and 
ENCODE data indicate that levels of enrichment shrink 
by half from the top to the middle and from the middle to 
the bottom of the sorted list. 

Table 2 shows the 1 82 motif-protein combinations with 
significant enrichment. Six motifs are enriched in the 
ChlP-seq data of at least 10 proteins. These motifs are 
heavily biased toward GC-rich sequences: GCGCCTGC 
GC (index 27), GGGGCGGGGC (index 33), GCCCCGC 
CCC (index 56), GGGCGGGGCC (index 65), GCGCAT 
GCGC (index 76) and GGCCCCGCCC (index 134). The 
GC-rich sequences match the binding motifs of several 



proteins such as SP1 (42), AP2 (43), NRF1 (44) and 
E2F1 (44). Reciprocally, the ChlP-seq files of seven 
proteins contain at least 10 enriched motif sequences: 
ELF1, GABP, YY1, ERG1, RAD21, POL2 and PAX 5. 
Some of these proteins (such as SP1, AP2, POL2, YY1, 
RAD21) ubiquitously regulate many genes, thus their 
binding motifs yield high evolutionary retention 
coefficients. 

Four motif-protein combinations enriched in 
ENCODE files correspond to exact match with the 
TRANSFAC data. Motifs 33 (GGGGCGGGGC) and 
36 (GGGGGCGGGG) match the SPl-binding motif in 
TRANSFAC. Motif 7 (CCGCCATCTT) matches the 
YY1 -binding motif, and motif 170 (CTTCCTCTTT) 
matches the PU.l -binding motif. 

Genes harbouring high-scoring motifs tend to retain 
functional coherence 

Genes sharing the same protein-binding sequences on 
their promoters are likely co-regulated by the same tran- 
scription factors. Consequently, we expect genes 
harbouring high-scoring motifs to possess functional co- 
herence. We validated this prediction with two tests using 
external data. First, using the method described in (39,40), 
we evaluated the divergence of expression profiles of genes 
from two human expression datasets and one mouse ex- 
pression dataset. The distribution of expression divergence 
on genes harbouring the top 5000 motifs was compared 
with the distribution on the genes harbouring the bottom 
5000 motifs. Intriguingly, genes harbouring the top 5000 
motifs have consistently lower expression divergence than 
genes harbouring bottom 5000 motifs across all three 
datasets. The Wilcox test P-values of the deviation 
between the two gene sets are significant across the three 
datasets: 2.047 x 10 -14 , 2.414 x 10 -4 and 1.670 x 10 -12 
respectively. Furthermore, by ruling out the two 
confounding factors for co-expression — co-localization 
of genes on the same chromosomes and paralogous 
genes sharing the same ancestry — the deviation of 
expression divergence between genes harbouring top 
5000 and bottom 5000 motifs remains pronounced. 
Table 3 reports the significance of the deviation of 
expression divergence between gene pairs harbouring top 
5000 and bottom 5000 motifs. The deviation of expression 
divergence suggests that genes harbouring motifs of high 
evolutionary retention coefficients tend to retain 
functional coherence. 

Second, we extracted the human genes harbouring each 
of the top 231 motif sequences and assessed their over- 
representations in 3201 GO categories and 889 pathways 
from three sources. Supplementary Table S5 reports the 
functional categories and pathways significantly enriched 
(hyper-geometric P < 0.001) with each top-ranking motif. 
There are 45 motif-functional class pairs with significant 
enrichment. In contrast, there are only 9 significant motif- 
functional class pairs among the 231 control motifs in the 
middle of the sorted list. 

By examining the functional enrichment results in 
Supplementary Table S5, we found that many top- 
ranking motifs were highly enriched in functional classes 
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related to transcriptional regulation such as nucleosome 
assembly (motif CCAGCTCCAG, P-value 3.264 x 10" 10 ), 
transcription factor activity (motif CCCCTCCCCC, 
P-value 8.773 x 10~ 10 ) and chromatin modification (motif 
CCGCCGCCGC, P-value 6.109 x 10" 6 ). To justify this 
observation, we categorized the GO terms into four 
classes: regulators (transcription factors and signalling 
proteins, 2563 genes), enzymes (6947 genes), transporters 
(1087 genes) and structural proteins (571 genes), and 
evaluated the enrichment P-values of top-ranking and 
control motif targets in each class. Figure 5 reports the 
enrichment of motif targets on the four major categories. 
Strikingly, among the targets of the top 231 motif 
sequences, 20 are significantly enriched with known 
regulators (P < 0.01). In contrast, the targets of only one 
motif are enriched with enzymes, transporters and 
structural proteins, respectively. Furthermore, among the 
targets of the 231 control motifs in the middle of the 
sorted list, only 3 have significant enrichment in regulators 
and none has significant enrichment in other classes. The 
results suggest that regulators tend to harbour motifs under 
stronger selective pressure on their promoters. 

Motif sequences with high evolutionary retention 
coefficients are derived by diverse causes 

Beyond statistical validations on the motif sequences 
sorted by evolutionary retention coefficients, we also 
examined the individual top-ranking motifs and annotated 



them with known regulatory sequences or repeat elements. 
Table 1 reports the functional annotations of the top 231 
motifs. Several remarkable features emerge from the 
annotations. First, 26 10-mers constitute two blocks of 
13 consecutive nucleotides (TAGCTCACAGCAACCTC 
AAACT and AGTTTGAGGTTGCTGTGAGCTA, 
respectively). These two blocks match exactly the Alu-J 
repeat elements (41). As shown in Figure 2, they have 
background level conservation between humans and 
other species but undergo a massive number of insertions 
in gray mouse lemurs and small-eared galagos. Second, 
another ten 10-mers constitute a block of 19 consecutive 
nucleotides (TGCAGCAGCTGCTGCTGCT). Unlike 
Alu-J this block does not hit human repeats or gene 
sequences with significant blast E-values. This block 
largely coincides with many binding sites of MITF and 
AP4 according to the cisRED database of genome-wide 
regulatory module and element predictions (44). Third, 
three 10-mers (motifs 24, 26, 28) are three phases of the 
GCC-repeat sequences, and they coincide with many 
binding sites of ERG1 and DEAF1 according to 
cisRED. Fourth, four 10-mers (motifs 82, 32, 37, 38) 
form a 13-nucleotide consecutive block (CTCTGATTG 
GCTG) and coincide with NF-Y binding sites. Three 
additional 10-mers also coincide with NF-Y binding 
sites. Fifth, seven 10-mers are dominated by Cs and Gs 
(motif 27, GCGCCTGCGC; motif 33, GGGGCGGGGC; 
motif 36, GGGGGCGGGG; motif 56, GCCCCGCCCC; 
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Table 3. Comparison of the distributions of expression divergence 
between gene pairs harbouring top 5000 motifs and bottom 5000 
motifs 



Wilcox test 


Human 


Human 


Mouse 




62-tissues 


1 1 -tissues 


58-tissues 




Affymetrix 


RNAseq 


Affymetrix 


All possible 


2.047 x 10 14 


2.414 x 10~ 4 


1.67 x 10- 12 


pairs 
Inter-chromosomal 


< io- 300 


9.565 x 10~ 5 


8.138 x 10~ 12 


pairs 
Non-paralogous 


3.042 x 10 14 


5.669 x 10~ 5 


1.141 x IO" 12 



pairs 



The Wilcox test P-values between the two distributions are shown. The 
three rows report the /'-values derived from different criteria for 
selecting gene pairs: all possible gene pairs, gene pairs on distinct 
chromosomes and non-paralogous gene pairs. The three columns 
report the P-values derived from three datasets of mRNA expressions: 
Affymetrix data of 62 human tissues, RNAseq data of 1 1 human tissues 
and Affymetrix data of 58 mouse tissues. 



motif 65, GGGCGGGGCC; motif 70, CCCCGCCCCC; 
motif 134, GGCCCCGCCC), and most of these motif 
sequences coincide with the binding sites of SP1, AP2 
and ERG1. Sixth, two 10-mers (motif 7, CCGCCATCT 
T; motif 53, GCCGCCATCT) form a consecutive block 
(GCCGCCATCTT) and largely coincide with the binding 
sites of YY1 (45). 



DISCUSSION 

Evolution of cw-regulatory elements is an essential and 
critical aspect of the evolution of the gene regulatory 
systems. Prior models and studies focus primarily on the 
sequence evolution of selected known cw-regulatory 
elements but do not characterize the evolution of all 
possible regulatory sequences. In this work, we propose 
a model to quantify the strength of purifying selection for 
motif sequences of a fixed length and estimate the 
evolutionary retention coefficients of all 10-mer sequences 
from the aligned promoters of 34 mammalian species. A 
series of validation tests confirm the functional relevance 
of the proposed evolutionary retention coefficients. High- 
scoring motifs are enriched with transcription factor- 
binding sites according to curated information from 
TRANSFAC and ChlP-seq experimental data from 
ENCODE. Furthermore, genes harbouring high-scoring 
motifs retain more coherent expression profiles in human 
and mouse and are over-represented in the functional 
categories and pathways involved in transcriptional 
regulation. 

Many high-scoring motif sequences are bound by 
regulatory proteins with versatile or prevalent functions: 
POL2, SP1, YY1, RAD21 and AP2. POL2 encodes the 
RNA polymerase II that interacts ubiquitously with 
DNAs. SP1 encodes a zinc-finger transcription factor 
involved in many cellular processes, including cell 
differentiation, cell growth, apoptosis, immune responses, 
response to DNA damage and chromatin remodelling. 
YY1 encodes a ubiquitously distributed transcription 
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regulator enzyme structure transporter 



Figure 5. Enrichment of four functional classes — regulators, enzymes, structural proteins and transporters — among the genes harbouring the top- 
ranking and control motifs. The horizontal axis denotes the four functional classes. The vertical axis denotes the motif index from high selection 
coefficients (top) to low selection coefficients (bottom). The top-ranking and control motifs are separated by a yellow line. Colours in the heat map 
denote the magnitudes of log 10(hyper geometric P-values) from —6 (bright red) to 0 (black). 



factor belonging to the GLI-Kruppel class of zinc finger 
proteins. RAD21 encodes a nuclear protein involved in 
the repair of DNA double-strand breaks, as well as in 
chromatid cohesion during mitosis. AP2 (TFAP2A) 
encodes a transcription factor that interacts with 
enhancer elements. Furthermore, the high-scoring motifs 
interacting with these proteins are highly biased toward 
GC-rich sequences. Ubiquitous presence of these motifs 
on many target genes probably accounts for their high 
evolutionary retention coefficients. In contrast, binding 
motifs of the transcription factors with small numbers of 
specific targets will not exhibit high evolutionary retention 
coefficients, as there are only a few instances on 
promoters. 

A bias toward abundant sequences among the high- 
scoring motifs may be relieved by adjusting the rates of 
the birth-death process to fit the background frequencies 
of motifs. Currently, the rates of drifting into and out of a 
motif depend primarily on the relative volume of the motif 
and frequencies of single nucleotides in sequence space (r 0 i 
and r 10 ). To further reduce this bias, we can adjust the 
weights of transitions in Equation (3) according to the 
background frequencies of motifs in the entire genomes. 

The birth-death model of neutral evolution (Equation 
7) is based on the simplest model of sequence substitution 
assuming all nucleotides transition with an equal rate. 



This assumption is incongruent with various observed 
biases in sequence evolution. For instance, on single 
nucleotides, the rates of transitions (purine — > purine or 
pyrimidine — > pyrimidine) are higher than those of 
transversions (purine — > pyrimidine or pyrimidine — > 
purine). On dinucleotides, the mutation rates of CpG — > 
TpG tend to be higher as methylated cytosines deaminate 
to form thymines. The current model can be extended to 
incorporate these biases with a price of complexity. In an 
extended version, the two parameters specifying relative 
transition rates from motifs to non-motifs and vice versa 
(r 0 i and r 10 ) depend not only on motif complexity but also 
on its constituting sequences. Transversions between 
purines and pyrimidines are penalized while substitutions 
from CpG to TpG are rewarded. For the computational 
cost for implementing and running this extended model, 
we decided to leave this task in the future work. 

Conservation of motif occurrences can be viewed as an 
aggregation of two factors: (i) the fraction of functional 
motif instances among all motif occurrences and (ii) the 
level of conservation among the functional motif 
instances. The function of a motif in eukaryotic genomes 
depends on various contextual factors such as the presence 
of other transcription factor-binding sites and enhancers, 
nucleosome positions and chromatin configurations. 
Hence only a fraction of motif occurrences are likely to 
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be functional and are subjected to selective constraints. 
Moreover, among these functional instances, differential 
levels of conservation may be manifested on distinct sites. 
Some regulatory subsystems are less tolerant with 
dysregulation and thus undergo a stronger selective 
pressure, whereas others may be more flexible and thus 
retain a lower level of conservation. Disentangling these 
two factors from sequence data alone is very challenging, 
as the contextual information often cannot be determined 
by sequences. Alternatively, functional information such 
as ChlP-seq data can provide additional clues about the 
first factor. By examining the overlaps of selected motifs 
and transcription factor-binding sites, we can estimate the 
fraction of functional motif instances. The level of 
conservation of a few transcription factor-binding motifs 
has been investigated in the prior studies mentioned in 
Introduction. 

It is puzzling that the top-ranking motifs are over- 
represented on the promoters of regulators (transcription 
factors and other DNA-binding proteins, signalling 
proteins) but not on other functional categories 
(enzymes, transporters, structural proteins). The results 
suggest that the regulatory circuits of regulators possess 
elements with high selective constraints, whereas those of 
other proteins do not. We provide two speculations to 
explain this observation. First, many regulators are 
involved in processes with pervasive impacts such as 
chromatin modification, nucleosome assembly and 
multicellular organismal development. Constituent genes 
of these processes are thus subjected to tighter selective 
constraints and are regulated by motifs with high 
evolutionary retention coefficients. Second, many motifs 
overrepresented in regulators are the GC-rich binding 
sequences of the aforementioned proteins. Regulatory 
programs of regulators may result from the combinatorial 
interactions between these generic motifs and other 
process-specific motifs. In contrast, the regulatory 
programs of other proteins may be dominated by 
process-specific motifs. Further experimental data and 
analysis are required in order to verify these speculations. 

In the present study, the log likelihood function is 
obtained by comparing motif occurrences between a 
reference species (human) and all the other species 
(Equation 9). This is not an exact form of the joint log 
likelihood function, as it assumes the motif occurrences of 
other species are independent conditioned on human data 
and ignores their dependencies owing to a shared 
phylogeny. A more accurate form is to sum up the log 
conditional probabilities along all branches of the 
phylogenetic tree: 

C = E E Ef(t,n 0 n 1 )logP(n bl (t) = n 1 |n bl (0)= n 0 )+C. 

bleT,l(bl)=t n 0 n, 

(12) 

where bl denotes a branch in the phylogenetic tree T with 
length l(bl) = t, n b i(0) and n b i(t) denote the motif 
occurrences at initial and terminal time points of the 
branch, respectively. Motif occurrences on ancestral 
nodes of the phylogenetic tree are not directly observed. 
Hence a dynamic programming algorithm can be applied 



to either reconstruct the maximum likelihood promoter 
sequences of ancestors or marginalize over the probability 
distributions of all possible sequence configurations (32). 
However, this accurate formulation requires 
reconstruction of 5kb upstream sequences (or their 
probability distributions) of 27 748 orthologous gene 
families from 34 mammalian species. Owing to its 
tremendous computational cost, we decided to implement 
the simplified approximation for an exhaustive screening 
of all 10-mer motif sequences on all orthologous gene 
families. For subsequent studies on specific motifs and 
selected gene families, the accurate version in Equation 
(12) should be adopted. 

In the present study, we consider each motif as one 
unique 10-mer nucleotide sequence. The formulation of 
the motif evolutionary model, however, does not impose 
this restriction. A motif can be a collection of sequences 
represented by one or multiple strings of 15 IUPAC 
symbols (e.g., the TP53 binding motif is 
NGRCWTGYCY, where R denotes A or G, W denotes 
A or T, Y denotes C or T and N denotes any base). The 
choice of investigating single sequence motifs is based on 
the concern of computational efficiency. Exhaustive 
evaluations of selection coefficients of all composite 
motifs are beyond the computing capacity accessible by 
the authors. For instance, there are 15 10 = 5.767 x 10 11 
motifs represented by 10-mers of IUPAC symbols 
without gaps. The number of these composite motifs is 
54994 folds as the number of 10-mer single sequence 
motifs, which would take about 333 million CPU hours 
using the current computing infrastructure. This number 
will grow exponentially when combinations of 10-mer 
IUPAC strings and gaps are considered. Therefore, 
simplifications without exhausting all possible sequences 
are required when extending the analysis into composite 
motif sequences. 
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